Main Questions
Our main questions:
- How have youth disconnection rates in American youth changed since 2008?
- In particular, how has this changed for different gender and ethnic groups? Are any groups particularly disconnected?
MICHAEL ONTIVEROS
There are some existing typos in the tables in addition to those produced during the use of Magick. I made some assumptions along the way, documenting them. These need to be checked for accuracy.
Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.
To cite this case study please use:
Wright, Carrie, and Ontiveros, Michael and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com/opencasestudies/ocs-youth-disconnection-case-study. Disparities in Youth Disconnection (Version v1.0.0).
According to this report youth disconnection although generally showing decreasing trends for the past 7 years, shows racial and ethnic disparities, where some groups are showing increased rates of disconnection.
So what does the term “youth disconnection” mean?
According to Measure of America (a nonpartisan project of the nonprofit Social Science Research Council that is focused on opportunity in the United States) disconnected youth are:
“young people between the ages of 16 and 24 who are neither working nor in school”
They state that such disconnection hinders these individuals to aquire skills and create relationships necessary to have a sucessful adulthood.
They state that:
“people who experience a period of disconnection as young adults go on to earn less and are less likely to be employed, own a home, or report good health by the time they reach their thirties”
Disconnected youth are also referred to as opportunity youth, which has the added positive connotation that promoting such individuals can be beneficial not only for these individuals but also for their communties and for society.
We will expand beyond the Measure of America annual report to take a deeper look at differences of specific groups of youths. Identifying youths particularly at risk or disconnected, can help inform the design of targeted prevention and rengagement strategies.
This case study is motivated by this article:
Mendelson, T., Mmari, K., Blum, R. W., Catalano, R. F. & Brindis, C. D. Opportunity Youth: Insights and Opportunities for a Public Health Approach to Reengage Disconnected Teenagers and Young Adults. Public Health Rep 133, 54S-64S (2018).
This article describes strategies for prevention of disconnection and reengagement of discconnected youth and how such interventions could greatly positively impact opportunity youth for the entire trajectory of their lives and for future generations. It also points out that indeed their are disparities among different racial/ethnic groups.
Our main questions:
In this case study, we will demonstrate how to import and wrangle data available in the Portable Document Format (PDF). We will especially focus on using packages and functions from the Tidyverse, such as dplyr, ggplot2. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R more legible and intuitive.
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
Data science skills:
magick packagedplyr for data wranglingtidyr)dplyrggplot2 that are in a similar style to an existing imageStatistical concepts and methods:
We will begin by loading the packages that we will need:
library(here)
library(tidyverse)
library(pdftools)
library(magick)
library(cowplot)
library(Kendall)| Package | Use |
|---|---|
| here | to easily load and save data |
| tidyverse | for data science operations |
| pdftools | to manage PDF documents |
| magick | for image processing |
The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.
So how does youth disconnection happen and what impact does it have?
There are many known risk factors, which have been identified in a variety of contexts (from family, friends, school, community, society) including:
These risk factors make it more likely for young people to miss out on education, training, and networking that can act as a foundation for a sucessful career.
There are also many known negative consequences associated with youth disconnection including but not limited to:
Photo by Jon Tyson on Unsplash
Furthermore, in 2012 it was estimated that each disconnected youth costs taxpayers $250000 during a life time due to lost tax revenue and costs for social sercices, heath care and criminal justice.
Youth disconnection can be described as a continuum, as some youths will be disconnected for a brief time, while others are chronically disconnected. Additionally, while an individual who is out of school and work and also has poor support from the realtionships of others may be further disconnected than an individual who has social support.
Here is an illustration of risk factors, protective factors and the continuum of disconnection:
##### [source]
Many programs have identified useful strategies in rengaging disconnected youth or preventing discconection of youth.
generally speaking, most programs focus on reengagement strategies, however, prevention strategies are likely to be just as important.
Reserach suggests that active involvement with at risk youth from infancy and across multiple developmental stages through young adulthood whould be the most beneficial.
In fact, the quality of parental caregiving of infants age 6-24 months has actually been shown to be a predictor of high school dropout rates! Thus early interventions may be very important and consistent continual engagement may prevent further disconnection of youths.
Prevention strategies include:
See here and here for listings of programs dedicated to rengaging disconnected youth or preventing disconnection.
See here and here for particular examples.
The statistics used in this section came from this article.
There are some important considerations regarding this data analysis to keep in mind:
This data used in the Measure of America project reports from the is derived from American Community Survey(ASC) which excludes or underrepresents certain opportunity youth groups, such as youths in the juvenile justice system, youths in the foster care system, and homeless youths as the survey is conducted on households. Furthermore, youths who may be more disconnected for other reasons besides not being in work or school, such as dealing with the added challenge of being a teenage mother, or being abused is not available in this dataset. Thus, this data likely underestimates youth disconnection rates.
Data about certain group intersections (meaning for example individuals of a particular gender and ethnicity) or particular groups in general such as specific ethnicities or gender or sexual identity groups such as LGBT (lesbian/gay/bisexual/transgender/queer and questioning) or nonbinary gender populations is unfortunately not available in the data used in this analysis and in most research about this topic. Luckily however, recent years of the ACS survey has more detailed infromation about a greater number of racial and ethnic groups and racial/ethnic intersections.
The statistical procedures we are using may be overly simplistic. In all data analysis, we need to be wary about deriving meaning from the statistical procedures we use.
Using image processing tools can be very helpful. The manner in which data is obtained with image processing tools is what we would describe as a black box process, a process with known inputs and outputs but unknown mechanics. Because we are unaware of how our outputs are generated from our inputs, we need to be wary of the output. With the small output we are creating in this case study, a visual inspection should suffice.
In this case study we will be using data related to youth disconnection from the two following reports from the Measure of America project:
Measure of America is a nonpartisan project of the nonprofit Social Science Research Council founded in 2007 to create easy-to-use yet methodologically sound tools for understanding well-being and opportunity in America. Through reports, interactive apps, and custom-built dashboards, Measure of America works with partners to breathe life into numbers, using data to identify areas of highest need, pinpoint levers for change, and track progress over time.
The data used in these reports comes from the American Community Survey(ASC), which is the largest survey conducted by the United States Census Bureau. The survey started in 2005 and collects data for 3.5 million households annually. Data is collected about ancestry, citizehsip, income, employment, disability among many other aspects. See here for more detailed information about the survey.
According to Wikipedia (https://en.wikipedia.org/wiki/American_Community_Survey){target="_blank"}:
Data is collected by internet, mail, telephone interviews and in-person interviews…About 95 percent of households across all response modes ultimately respond… ACS responses are confidential… and “immune from legal process”
It is a mandatory survey, it is governed by federal laws that could impose a fine of as much as $5,000 for refusing to participate.
We are particuarlly interesed in the following tables on the last page of the Measure of America 2019 report:
We are particuarlly interesed in the tables on the following pages from the Measure of America 2020 report:
One way to import data from a pdf is to use the pdf_text() function of the pdftools package. The here() function of the here package can allow us to specify where the document that we want to import is located easily, starting from the directory where a .Rproj file is located. In this case, we will import the Making_the_Connection.pdf in the docs directory. Note this is the case if you pull the repository from github.
We can take a look at the output for the page with our table of interests by simpy using brackets [] around the page number. The page we are interested in (athough called 39 in the report) is the 44th page, which looks like this:
[1] "Youth Disconnection by Gender and by Race and Ethnicity\n I NDI CATOR TA BLE S\n DISCONNECTED YOUTH\nMAJOR RACIAL AND RATE (% ages 16–24) 2017 CHANGE IN RATE\nETHNIC GROUPS 2008 2010 2012 2014 2016 (%) (#) 2010–2017 (%)\nUnited States 12.6 14.7 14.1 13.2 11.7 11.5 4,501,800 -22.1\nMale 12.3 15.2 14.5 13.3 12.1 11.8 2,382,500 -22.5\nFemale 12.9 14.1 13.7 13.0 11.2 11.1 2,119,400 -21.7\nASIAN 7.1 8.5 7.8 7.9 6.6 6.6 145,600 -21.7\nAsian Male 6.3 8.3 7.4 7.2 6.7 6.5 73,000 -21.4\nAsian Female 7.9 8.6 8.1 8.6 6.6 6.7 72,600 -22.0\nWHITE 9.7 11.7 11.2 10.8 9.7 9.4 1,961,700 -20.1\nWhite Male 9.5 12.3 11.5 10.8 10.0 9.6 1,031,200 -22.4\nWhite Female 10.0 11.1 10.8 10.7 9.4 9.1 930,600 -17.4\nLATINO 16.7 18.5 17.3 15.2 13.7 13.2 1,157,300 -28.7\nLatino Male 13.6 16.8 16.0 14.0 12.6 12.4 562,600 -26.0\nLatina Female 20.2 20.3 18.8 16.5 14.8 13.9 594,700 -31.5\nBLACK 20.4 22.5 22.4 20.6 17.2 17.9 999,700 -20.6\nBlack Male 23.7 26.0 25.6 23.5 20.1 20.8 591,600 -19.8\nBlack Female 17.0 19.0 19.3 17.6 14.2 14.8 408,000 -22.1\nNATIVE AMERICAN 24.4 28.8 27.0 26.3 25.8 23.9 67,700 -17.1\nNative American Male 25.0 30.9 28.0 26.9 28.1 23.3 33,200 -24.5\nNative American Female 23.9 26.7 25.9 25.6 23.4 24.5 34,500 -8.4\n 2017 2017\n ASIAN SUBGROUPS (%) (#) LATINO SUBGROUPS (%) (#)\n United States 11.5 4,501,800\n Male 11.8 2,382,500 LATINO 13.2 1,157,300\n Female 11.1 2,119,400 Latino Male 12.4 562,600\n ASIAN 6.6 145,600 Latina Female 13.9 594,700\n Asian Male 6.5 73,000 SOUTH AMERICAN 8.4 37,600\n Asian Female 6.7 72,600 South American Male 9.1 20,400\n CHINESE 4.3 23,800 South American Female 7.7 17,200\n Chinese Male 4.7 12,700 CENTRAL AMERICAN 12.0 93,100\n Chinese Female 3.9 11,100 Central American Male 9.3 37,900\n VIETNAMESE 5.5 13,500 Central American Female 15.0 55,200\n Vietnamese Male 7.5 9,300\n MEXICAN 13.3 762,400\n Vietnamese Female 3.4 4,200\n Mexican Male 12.2 358,200\nINDIAN 5.9 22,300\n Mexican Female 14.4 404,200\nIndian Male 4.1 8,000 OTHER LATINO 13.6 44,800\nIndian Female 7.8 14,300\n Other Latino Male 15.3 27,600\n PAKISTANI 6.4 4,900\n Other Latina Female 11.5 17,300\n Pakistani Male\n PUERTO RICAN, DOMINICAN, CUBAN 15.1 211,200\n Pakistani Female\n PR, DR, Cuban Female 15.7 114,500\n KOREAN 6.5 11,200\n PR, DR, Cuban Female 14.4 96,600\n Korean Male 8.0 6,900\n Korean Female 5.0 4,200\n NOTE: Blank cells indicate the estimate is unreliable\n TWO OR MORE 6.6 4,000\n Two or More Male\n Two or More Female\nFILIPINO 7.3 23,400\nFilipino Male 6.5 10,800\nFilipino Female 8.1 12,700\nHMONG 14.0 8,300\nHmong Male 18.6 5,700\nHmong Female\nMAKING THE CONNECTION | Transportation and Youth Disconnection 39\n"
From the output, it’s clear that a relatively large amount of manipulation will be required to wrangle this data. If you are interested in learning more about this method, please see this case study and this case study.
While not impossible, using the pdftools package in this scenario will require some advanced data wrangling.
While our output may be reproducible, this process may be too time consuming.
Fortunately, there is another way we can proceed to wrangle the data.
We will demonstrate how to produce reproducible tables with image processing software in R using a package called magick which allows for the extraction of text from images.
For demonstrative purposes, we will import two sets of data. The first set of data will be used to highlight common errors that the image processing software may produce. The second set of data will be used to demonstrate how to circumvent these errors and produce reproducible datasets efficiently.
magickWe will now import the data using the magick package which allows for the improtation of images.
First we will take a screenshot of the top part of the gender, race, and ethnicity table on the last page of the 2019 Measure of America Report.
We can show what this file looks like in this rendered rmarkdown website by using the include_graphics() function of the knitr package.
Now, we will use the image_read() function of the magick package to import this image.
We can then use teh image_info() function to make sure that the import worked and to get information about the size, format and color of the image.
image_example <- magick::image_read(here::here("img", "gender_race_ethnicity.png"))
magick::image_info(image_example)# A tibble: 1 x 7
format width height colorspace matte filesize density
<chr> <int> <int> <chr> <lgl> <int> <chr>
1 PNG 997 711 sRGB TRUE 164964 72x72
Now let’s take a look at our image in R! Now that we have imported it to see this image, we simply need to type the name of the image.
Nice!
We will demonstrate in a bit that the top part of the table causes issues when extracting the text from this image. So now we will take a screen shot without the top part of the table and do the same process.
cropped_table_gen_race_eth_2018 <- image_read(here("img", "gender_race_ethnicity2.png"))
cropped_table_gen_race_eth_2018 Let’s import one more image just for fun. Here we will import an image directly from a URL.
ggplot2_logo <- image_read("https://d33wubrfki0l68.cloudfront.net/2c6239d311be6d037c251c71c3902792f8c4ddd2/12f67/css/images/hex/ggplot2.png")
ggplot2_logoNow we will use the image_ocr() function of the magick package to extract the text from the OCS logo image. This function uses the tesseract package which has tools for optical character recognition (OCR), hence the ocr in the function name. This allows the function to identify text in images. These OCR tools have often been developed using machine learning in which an algorithm was trained on images with and without text to “learn” to recognize text. See here to learn more about how OCR works.
[1] "ggplot2\n"
Awesome! We were able to extract text from this hex sticker!
One thing to keep in mind is that this doesn’t always work. Unusual font, angles text, or particular colors can be difficult for the OCR to recoginize.
Here is an example that does not work with the current version of magick:
[1] ""
This is likely do to the background on this particular hex sticker.
Now let’s try extracting the text from our image files.
The first image we imported looks like this.
Now we will extract the text!
[1] "Youth Disconnection by Gender and by Race and Ethnicity\nbyte) ee\nMAJOR RACIAL AND Cr erent) ed Nas\nUnited States 126 147 14.1 13.21.7115 4,501,800 22.1\nMale 12.3 15.2 14.5 13.3 12.1 11.8 2,382,500 -22.5\nFemale 12.9 14.1 13.7 13.0 11.2 11.1 2,119,400 -21.7\nASIAN 71 85 78 19 66 66 145,600 -21.7\nAsian Male 6.3 8.3 74 7.2 6.7 65 73,000 -21.4\nAsian Female 7.9 8.6 8.1 8.6 6.6 6.7 72,600 -22.0\nWHITE 97 210897 9.4 1,961,700 -20.1\nWhite Male 95 12.3 11.5 10.8 10.0 9.6 1,031,200 -22.4\nWhite Female 10.0 11.1 10.8 10.7 9.4 91 930,600 -17.4\nLATINO 167-185 17.3 15.2 13.7 =~ 138.2_—_—‘1,157,300 -28.7\nLatino Male 13.6 16.8 16.0 14.0 12.6 12.4 562,600 -26.0\nLatina Female 20.2 20.3 18.8 16.5 14.8 13.9 594,700 -31.5\nBLACK 20.4 225 224 206 17.2 17.9 999,700 -20.6\nBlack Male 23.7 26.0 25.6 23.5 20.1 20.8 591,600 -19.8\nBlack Female 17.0 19.0 19.3 17.6 14.2 14.8 408,000 -22.1\nNATIVE AMERICAN 264 288 27.0 23 28 23.9 67,700 -17.1\nNative American Male 25.0 30.9 28.0 26.9 28.1 23.3 33,200 -24.5\nNative American Female 23.9 26.7 25.9 25.6 23.4 24.5 34,500 -B.4\n"
This looks like it worked fairly well!
You may notice that there are lots of \n values in the text from our image. These are newline characters, which denote the end of a line of text and the start of a new line of text.
We can use the str_split() function of the stringr package to split based on the \n characters in the output. We will then unlist the output using the base R unlist() function. By base, we mean that the function it is loaded automatically in an R session. Finally we will use the as_tibble() function of the tibble package to convert the data into tibble format, which is the tidyverse version of a data frame. This will allow us to see the values in the table much better.
To do all of these sequential steps efficiently we will use a method called piping.
Click here if you are unfamiliar with piping in R, which uses this
%>% operator
%>% pipe operator which is accessible after loading the tidyverse or several of the packages within the tidyverse like dplyr because they load the magrittr package. This allows us to perform multiple sequential steps on one data input.
# A tibble: 22 x 1
value
<chr>
1 Youth Disconnection by Gender and by Race and Ethnicity
2 byte) ee
3 MAJOR RACIAL AND Cr erent) ed Nas
4 United States 126 147 14.1 13.21.7115 4,501,800 22.1
5 Male 12.3 15.2 14.5 13.3 12.1 11.8 2,382,500 -22.5
6 Female 12.9 14.1 13.7 13.0 11.2 11.1 2,119,400 -21.7
7 ASIAN 71 85 78 19 66 66 145,600 -21.7
8 Asian Male 6.3 8.3 74 7.2 6.7 65 73,000 -21.4
9 Asian Female 7.9 8.6 8.1 8.6 6.6 6.7 72,600 -22.0
10 WHITE 97 210897 9.4 1,961,700 -20.1
# … with 12 more rows
OK, not bad, the top looks a bit strange but the rest of the table looks fairly good, but there are some rows that look particularly strange like the row that starts with LATINO, or you may notice that the row for Native American females ends with -B.4:
[[1]]
[1] "LATINO" "167-185" "17.3"
[4] "15.2" "13.7" "=~"
[7] "138.2_—_—‘1,157,300" "-28.7"
[[1]]
[1] "Native" "American" "Female" "23.9" "26.7" "25.9"
[7] "25.6" "23.4" "24.5" "34,500" "-B.4"
Data wrangling is not an exact science. The approaches we can take are extremely dependent on the data. We can exploit patterns in the data to render the output we desire.
The first few lines of our table have quite a bit of special formatting, there are different font colors and backgrounds.As we saw previously, this can sometimes cause issues. So now we will try use the cropped version of the image.
df1 <- image_ocr(cropped_table_gen_race_eth_2018)
df1 <- df1 %>%
str_split("\n") %>%
unlist() %>%
as_tibble()
df1# A tibble: 19 x 1
value
<chr>
1 "United States 12.6 14.7 14.1 13.2 11.7 11.5 4,501,800 -22.1"
2 "Male 12.3 15.2 14.5 13.3 12.1 11.8 2,382,500 -22.5"
3 "Female 12.9 14.1 13.7 13.0 11.2 11.1 2,119,400 -21.7"
4 "ASIAN 71 8.5 7.8 79 6.6 6.6 145,600 -21.7"
5 "Asian Male 63 8.3 74 7.2 67 6.5 73,000 -21.4"
6 "Asian Female 79 8.6 8.1 8.6 6.6 67 72,600 -22.0"
7 "WHITE 9.7 11.7 11.2 10.8 9.7 94 1,961,700 -20.1"
8 "White Male 95 12.3 11.5 10.8 10.0 9.6 1,031,200 -22.4"
9 "White Female 10.0 11.1 10.8 10.7 9.4 9.1 930,600 -17.4"
10 "LATINO 16.7 18.5 17.3 15.2 13.7 13.2 1,157,300 -28.7"
11 "Latino Male 13.6 16.8 16.0 14.0 12.6 12.4 562,600 -26.0"
12 "Latina Female 20.2 20.3 18.8 16.5 14.8 13.9 594,700 -31.5"
13 "BLACK 20.4 22.5 22.4 20.6 17.2 17.9 999,700 -20.6"
14 "Black Male 23.7 26.0 25.6 23.5 20.1 20.8 591,600 -19.8"
15 "Black Female 17.0 19.0 19.3 17.6 14.2 14.8 408,000 -22.1"
16 "NATIVE AMERICAN 24.4 28.8 27.0 26.3 25.8 23.9 67,700 -17.1"
17 "Native American Male 25.0 30.9 28.0 26.9 28.1 23.3 33,200 -24.5"
18 "Native American Female 23.9 26.7 25.9 25.6 23.4 24.5 34,500 -8.4"
19 ""
This looks much better!
It’s important to look very carefully at the text. Now although we no longer have the previous issues. There are some values missing a decimal place. For example the row that starts with ASIAN, the first value is missing a decimal place.
Now we will try using separtate images of just the names and the data. We only want data about the percent each year, so we can exclude the last few columns in this image. The larger an image is and the better the resolution, the more likely that the text will be extracted correctly.
ethnic_groups <-image_read(here::here("img", "first_table_ethnic_groups.png"))
years <-image_read(here::here("img", "first_table_years.png"))
ethnic_groups Now let’s try extracting the text from these images and separating the rows by the newline character
\n.
Question Opportunity
Can you recall the commands to do this?
Click here to reveal the code.
# A tibble: 33 x 1
value
<chr>
1 "United States"
2 ""
3 "Male"
4 ""
5 "Female"
6 ""
7 "ASIAN"
8 ""
9 "Asian Male"
10 ""
# … with 23 more rows
# A tibble: 19 x 1
value
<chr>
1 "12.6 14.7 14.1 13.2 11.7 11.5"
2 "12.3 15.2 14.5 13.3 12.1 11.8"
3 "12.9 14.1 13.7 13.0 11.2 11.1"
4 "7.1 8.5 7.8 79 6.6 6.6"
5 "6.3 8.3 74 7.2 6.7 6.5"
6 "7.9 8.6 8.1 8.6 6.6 6.7"
7 "9.7 11.7 11.2 10.8 9.7 9.4"
8 "9.5 12.3 11.5 10.8 10.0 9.6"
9 "10.0 11.1 10.8 10.7 9.4 9.1"
10 "16.7 18.5 17.3 15.2 13.7 13.2"
11 "13.6 16.8 16.0 14.0 12.6 12.4"
12 "20.2 20.3 18.8 16.5 14.8 13.9"
13 "20.4 22.5 22.4 20.6 17.2 17.9"
14 "23.7 26.0 25.6 23.5 20.1 20.8"
15 "17.0 19.0 19.3 17.6 14.2 14.8"
16 "24.4 28.8 27.0 26.3 25.8 23.9"
17 "25.0 30.9 28.0 26.9 28.1 23.3"
18 "23.9 26.7 25.9 25.6 23.4 24.5"
19 ""
OK! this looks pretty good! The only issue is we have some extra rows in between each row in the ethnic_groups object.
We can remove these extra rows using the filter() function of the dplyr package to exclude all rows that are just an empty string and thus a set of quotes "" using the not equals operator !=.
We are also going to use a special pipe operator from the magrittr package called the compound assignment pipe-operator or sometimes the double pipe operator.
This allows us to use the ethnic_groups as our input and reassign it at the end after all the steps have been performed.
# A tibble: 18 x 1
value
<chr>
1 United States
2 Male
3 Female
4 ASIAN
5 Asian Male
6 Asian Female
7 WHITE
8 White Male
9 White Female
10 LATINO
11 Latino Male
12 Latina Female
13 BLACK
14 Black Male
15 Black Female
16 NATIVE AMERICAN
17 Native American Male
18 Native American Female
Let’s also get rid of the all caps for the major categories. We can convert the words to only capitalize the first letter using the str_to_title() function of the stringr package.
# A tibble: 18 x 1
value
<chr>
1 United States
2 Male
3 Female
4 Asian
5 Asian Male
6 Asian Female
7 White
8 White Male
9 White Female
10 Latino
11 Latino Male
12 Latina Female
13 Black
14 Black Male
15 Black Female
16 Native American
17 Native American Male
18 Native American Female
Nice! that looks better.
For the year data we would like to try splitting the strings for each row into different columns based on a space. Currently all the data is listed in one column called value.
We can use the separate function of the tidyr pacakge to do this. This will allow us to not only split the rows by spaces, but also to create column names.
There are three important arguments for the seperate() function:
- col - this specifies what column you are separating
- into - this specifies the names of the new columns you are creating
- sep - this specifies what character string to look for to separate by
years %<>%
tidyr::separate(col = value,
into = c("2008", "2010",
"2012", "2014",
"2016", "2017"),
sep = " ")
years# A tibble: 19 x 6
`2008` `2010` `2012` `2014` `2016` `2017`
<chr> <chr> <chr> <chr> <chr> <chr>
1 "12.6" 14.7 14.1 13.2 11.7 11.5
2 "12.3" 15.2 14.5 13.3 12.1 11.8
3 "12.9" 14.1 13.7 13.0 11.2 11.1
4 "7.1" 8.5 7.8 79 6.6 6.6
5 "6.3" 8.3 74 7.2 6.7 6.5
6 "7.9" 8.6 8.1 8.6 6.6 6.7
7 "9.7" 11.7 11.2 10.8 9.7 9.4
8 "9.5" 12.3 11.5 10.8 10.0 9.6
9 "10.0" 11.1 10.8 10.7 9.4 9.1
10 "16.7" 18.5 17.3 15.2 13.7 13.2
11 "13.6" 16.8 16.0 14.0 12.6 12.4
12 "20.2" 20.3 18.8 16.5 14.8 13.9
13 "20.4" 22.5 22.4 20.6 17.2 17.9
14 "23.7" 26.0 25.6 23.5 20.1 20.8
15 "17.0" 19.0 19.3 17.6 14.2 14.8
16 "24.4" 28.8 27.0 26.3 25.8 23.9
17 "25.0" 30.9 28.0 26.9 28.1 23.3
18 "23.9" 26.7 25.9 25.6 23.4 24.5
19 "" <NA> <NA> <NA> <NA> <NA>
Looks pretty good!
We appear to have an empty row at the very end. Since all the values are NA, we can use the drop_na() function of the tidyr package to remove it.
# A tibble: 18 x 6
`2008` `2010` `2012` `2014` `2016` `2017`
<chr> <chr> <chr> <chr> <chr> <chr>
1 12.6 14.7 14.1 13.2 11.7 11.5
2 12.3 15.2 14.5 13.3 12.1 11.8
3 12.9 14.1 13.7 13.0 11.2 11.1
4 7.1 8.5 7.8 79 6.6 6.6
5 6.3 8.3 74 7.2 6.7 6.5
6 7.9 8.6 8.1 8.6 6.6 6.7
7 9.7 11.7 11.2 10.8 9.7 9.4
8 9.5 12.3 11.5 10.8 10.0 9.6
9 10.0 11.1 10.8 10.7 9.4 9.1
10 16.7 18.5 17.3 15.2 13.7 13.2
11 13.6 16.8 16.0 14.0 12.6 12.4
12 20.2 20.3 18.8 16.5 14.8 13.9
13 20.4 22.5 22.4 20.6 17.2 17.9
14 23.7 26.0 25.6 23.5 20.1 20.8
15 17.0 19.0 19.3 17.6 14.2 14.8
16 24.4 28.8 27.0 26.3 25.8 23.9
17 25.0 30.9 28.0 26.9 28.1 23.3
18 23.9 26.7 25.9 25.6 23.4 24.5
Great! Now we only have 18 rows.
Now let’s make these values numeric. Currently we can tell that they are character strings based on the <char> values listed under each column name.
Click here for an explanation of what a character string is
There are several classes of data in R programming. Character is one of these classes. A character string is an individual data value made up of characters. This can be a paragraph, like the legend for the table, or it can be a single letter or number like the letter "a" or the number "3". If data are of class character, than the numeric values will not be processed like a numeric value in a mathematical sense. If you want your numeric values to be interpreted that way, they need to be converted to a numeric class. The options typically used are integer (which has no decimal place) and double precision (which has a decimal place).
To convert our values to be numeric we can use the base as.numeric() function. To apply this to all the rows in years we can use the map_df() function of the purrr package. The map() would also work, however, the special map_df() function keeps the output in the same tibble format. A . is necessary to tell the map_df() function to apply the as_numeric() function to the values of the years tibble.
# A tibble: 18 x 6
`2008` `2010` `2012` `2014` `2016` `2017`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12.6 14.7 14.1 13.2 11.7 11.5
2 12.3 15.2 14.5 13.3 12.1 11.8
3 12.9 14.1 13.7 13 11.2 11.1
4 7.1 8.5 7.8 79 6.6 6.6
5 6.3 8.3 74 7.2 6.7 6.5
6 7.9 8.6 8.1 8.6 6.6 6.7
7 9.7 11.7 11.2 10.8 9.7 9.4
8 9.5 12.3 11.5 10.8 10 9.6
9 10 11.1 10.8 10.7 9.4 9.1
10 16.7 18.5 17.3 15.2 13.7 13.2
11 13.6 16.8 16 14 12.6 12.4
12 20.2 20.3 18.8 16.5 14.8 13.9
13 20.4 22.5 22.4 20.6 17.2 17.9
14 23.7 26 25.6 23.5 20.1 20.8
15 17 19 19.3 17.6 14.2 14.8
16 24.4 28.8 27 26.3 25.8 23.9
17 25 30.9 28 26.9 28.1 23.3
18 23.9 26.7 25.9 25.6 23.4 24.5
We are now ready to put the ethnic_groups and years objects together.
We can do so using the col_bind function.
To rename the column name of the value variable of the disconnection tibble we will use the rename() function of the dplyr package.
The new name needs to be listed first before the old name.
From Michael:
We split the dataframe in two: a labels section and a “data” section containing the information we are interested in.
In the first half, we remove all digits and punctuation to ensure that we are left with character labels.
In the second-half, remove commas and periods, converting the resulting string character class to numeric and selectively multiplying columns to reintroduce the decimal point correctly.
df1 <- df1 %>%
pull(value) %>%
str_split(" ")
df1a <- df1 %>%
purrr::map(~base::paste(.,collapse = "")) %>%
purrr::map(~base::gsub("[[:digit:]]+|[[:punct:]]+", "",.)) %>%
base::do.call(base::rbind,.) %>%
base::data.frame() %>%
dplyr::tibble()
df1b <- map(df1, tail, 8) %>%
map(~gsub("[,]+|[.]+", "",.)) %>%
do.call(rbind,.) %>%
data.frame() %>%
dplyr::mutate_if(base::is.character, base::as.numeric) %>%
dplyr::mutate_at(vars(-X7), ~ . * 0.1) %>%
tibble()
base::rm(df1)We combine the two sections of data to create our dataframe, removing and then adding column names.
df1 <- dplyr::bind_cols(df1a,
df1b)
base::names(df1) <- c()
column_names <- c("Group",
"Perc_2008",
"Perc_2010",
"Perc_2012",
"Perc_2014",
"Perc_2016",
"Perc_2017",
"N_2017",
"Delta_perc")
base::colnames(df1) <- column_namesWe remove columns with information we don’t need and use the commmon pattern in the column names to convert the data into long/narrow format.
df1 <- df1 %>%
dplyr::select(-N_2017,
-Delta_perc)
df1 <- df1 %>%
tidyr::pivot_longer(cols=contains("Perc_"),
names_to = "Year",
values_to = "Rate",
names_prefix = "Perc_") %>%
dplyr::mutate(Year = as.numeric(Year))We use the dplyr::case_when() and stringr::str_detect() function to detect patterns and create an separate column with gender and race information.
The two columns created contain TRUE/FALSE statements. These are then used to create a third column that will allow us to separate the data by its summary level.
# A tibble: 114 x 3
Group Year Rate
<chr> <dbl> <dbl>
1 UnitedStates 2008 12.6
2 UnitedStates 2010 14.7
3 UnitedStates 2012 14.1
4 UnitedStates 2014 13.2
5 UnitedStates 2016 11.7
6 UnitedStates 2017 11.5
7 Male 2008 12.3
8 Male 2010 15.2
9 Male 2012 14.5
10 Male 2014 13.3
# … with 104 more rows
disconnection %<>%
mutate(Race = recode(Group, "United States" = "All_races",
"Female" = "All_races",
"Male" = "All_races")) %>%
mutate(Race = str_remove(Race, pattern = " Female| Male"))
disconnection %<>%
mutate(Gender = str_extract(Group, "Female|Male")) %>%
mutate(Gender = replace_na(Gender, replace = "Gender_Total"))
#### this variable seems unneccessary... unless faceting maybe by it later?? Ok it is used for a figure to only get race total groups... thats all - it is also coded this way for the asian subgroups...
# this could easily be done though using some filter steps like this...
# disconnection %>%
# filter(Gender == "Gender_Total",
# Group != "United States")
# could then do by gender- males of each race
#disconnection %>%
#filter(Gender == "Male",
# Race != "All_races")
# disconnection %<>%
# mutate(Type = case_when(
# str_detect(Group, "United States") ~ "US Total",
# str_detect(Group, "Female") ~ "Gender Total",
# str_detect(Group, "Male") ~ "Gender Total",
# str_detect(Group, " Female| Male") ~ "Subgroup Total",
# TRUE ~ "Race Total"))dis_long <- disconnection %>%
tidyr::pivot_longer(cols = contains("20"),
names_to = "Year",
values_to = "Rate",
names_prefix = "Perc_") %>%
dplyr::mutate(Year = as.numeric(Year))df1 <- df1 %>%
mutate(Gender = dplyr::case_when(
stringr::str_detect(Group, "Female") ~ TRUE,
stringr::str_detect(Group, "Male") ~ FALSE,
TRUE ~ NA),
Race = stringr::str_remove_all(Group,
pattern = paste(c("Female","Male","UnitedStates"),
collapse = "|"))) %>%
mutate(Race = dplyr::na_if(Race, ""))
df1 <- df1 %>%
mutate(Type = case_when(base::is.na(Gender) &
base::is.na(Race) ~ "US Total",
base::is.na(Gender) &
!base::is.na(Race) ~ "Race Total",
!base::is.na(Gender) &
base::is.na(Race) ~ "Gender Total",
TRUE ~ "Subgroup Total"))
df1 <- df1 %>%
mutate(Gender = case_when(str_detect(Group, "Female") ~ "Female",
str_detect(Group, "Male") ~ "Male"))The columns of character class currently contain upper and lower case characters. We want to ensure that we use a cases consistently to ensure that there are no errors driven by case-sensitive function downstream.
To do this, we use the base::to_upper() function. This function makes all characters uppercase.
Finally, we homogenize the labels assigned for certain groups, fill in missing (NA) values with a string, and remove columns we no longer need from the dataframe.
df1 <- df1 %>%
mutate(Race = case_when(Race == "LATINA" ~ "LATINO/A",
Race == "LATINO" ~ "LATINO/A",
Race == "NATIVEAMERICAN" ~ "NATIVE AMERICAN",
TRUE ~ Race)) %>%
mutate(Gender = tidyr::replace_na(Gender, "ALL"),
Race = tidyr::replace_na(Race, "ALL")) %>%
dplyr::select(-Group)We can repeat this process for the other two tables listed on page 39.
Let’s look at the table without the special formatting.
As you can see, there are empty spaces. According to the PDF, these spaces are empty to denote that the estimates are unreliable.
This may cause problems. Whitespace must be handled differently. We may not want to process the entire image for this reason.
Instead, we can use separate images to ensure a simpler process like that above.
We read the three images.
image2a <- image_read(here("img", "asian_subgroupsA.png"))
image2b <- image_read(here("img", "asian_subgroupsB.png"))
image2c <- image_read(here("img", "asian_subgroupsC.png"))
magick::image_info(image2a)# A tibble: 1 x 7
format width height colorspace matte filesize density
<chr> <int> <int> <chr> <lgl> <int> <chr>
1 PNG 757 687 sRGB TRUE 95660 72x72
# A tibble: 1 x 7
format width height colorspace matte filesize density
<chr> <int> <int> <chr> <lgl> <int> <chr>
1 PNG 751 177 sRGB TRUE 28913 72x72
# A tibble: 1 x 7
format width height colorspace matte filesize density
<chr> <int> <int> <chr> <lgl> <int> <chr>
1 PNG 760 216 sRGB TRUE 33402 72x72
The images look like this.
We save the text from the images into objects.
df2a <- magick::image_ocr(image2a)
df2b <- magick::image_ocr(image2b)
df2c <- magick::image_ocr(image2c)We process these objects separately. Note that we use a very similar process to that employed in the wrangling of the previous table.
df2a <- df2a %>%
strsplit("\n") %>%
unlist() %>%
as_tibble()
df2b <- df2b %>%
strsplit("\n") %>%
unlist() %>%
as_tibble()
df2c <- df2c %>%
strsplit("\n") %>%
unlist() %>%
as_tibble()We then combine the objects with the dplyr::bind_rows() function.
The process is now very similar to the previous table.
MICHAEL ONTIVEROS
I used base R to remove the first three rows of a dataframe in the following chunk. I am not aware of a tidyverse solution for this; I am sure one exists.
df2 <- bind_rows(df2a,
df2b,
df2c)
df2 <- df2 %>%
dplyr::select(value) %>%
pull(value) %>%
str_split(" ")
df2b <- map(df2, tail, 2) %>%
map(~gsub("[,]+|[.]+", "",.)) %>%
do.call(rbind,.) %>%
data.frame() %>%
mutate_if(is.character, as.numeric) %>%
mutate_at(vars(-X2), ~ . * 0.1) %>%
tibble()
df2a <- df2 %>%
map(~paste(.,collapse = "")) %>%
map(~gsub("[[:digit:]]+|[[:punct:]]+", "",.)) %>%
do.call(rbind,.) %>%
data.frame() %>%
tibble()
df2 <- bind_cols(df2a, df2b)
names(df2) <- c()
column_names <- c("Group",
"Rate",
"N_2017")
colnames(df2) <- column_names
df2 <- df2 %>%
dplyr::select(-N_2017)
df2 <- df2 %>%
dplyr::mutate(Year = 2017)
df2 <- df2 %>%
mutate(Gender = case_when(str_detect(Group, "Female") ~ TRUE,
str_detect(Group, "Male") ~ FALSE,
TRUE ~ NA),
Subgroup = str_remove_all(Group,
pattern = paste(c("Female",
"Male",
"ASIAN",
"Asian"),
collapse = "|"))) %>%
mutate(Subgroup = na_if(Subgroup, ""))
glimpse(df2)Rows: 25
Columns: 5
$ Group <chr> "UnitedStates", "Male", "Female", "ASIAN", "AsianMale", "Asi…
$ Rate <dbl> 11.5, 11.8, 11.1, 6.6, 6.5, 6.7, 4.3, 4.7, 3.9, 5.5, 7.5, 3.…
$ Year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, …
$ Gender <lgl> NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE…
$ Subgroup <chr> "UnitedStates", NA, NA, NA, NA, NA, "CHINESE", "Chinese", "C…
df2 <- df2[-(1:3),]
df2 <- df2 %>%
mutate(Type = case_when(is.na(Gender) & is.na(Subgroup) ~ "Asian Total",
is.na(Gender) & !is.na(Subgroup) ~ "Subgroup Total",
!is.na(Gender) & is.na(Subgroup) ~ "Gender Total",
TRUE ~ "Subgroup Total"))
glimpse(df2)Rows: 22
Columns: 6
$ Group <chr> "ASIAN", "AsianMale", "AsianFemale", "CHINESE", "ChineseMale…
$ Rate <dbl> 6.6, 6.5, 6.7, 4.3, 4.7, 3.9, 5.5, 7.5, 3.4, 5.9, 4.1, 7.8, …
$ Year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, …
$ Gender <lgl> NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE…
$ Subgroup <chr> NA, NA, NA, "CHINESE", "Chinese", "Chinese", "VIETNAMESE", "…
$ Type <chr> "Asian Total", "Gender Total", "Gender Total", "Subgroup Tot…
df2 <- df2 %>%
mutate(Gender = case_when(str_detect(Group, "Female") ~ "Female",
str_detect(Group, "Male") ~ "Male"))
df2 <- df2 %>%
mutate_if(is.character, tolower) %>%
mutate_if(is.character, toupper)
df2 <- df2 %>%
mutate(Gender = replace_na(Gender, "ALL"),
Subgroup = replace_na(Subgroup, "ALL")) %>%
dplyr::select(-Group) %>%
mutate(Subgroup)
df2 <- df2 %>%
mutate(Subgroup = case_when(Subgroup == "TWOORMORE" ~ "TWO OR MORE",
TRUE ~ Subgroup))Note that we took a process that had successfully worked for us and modified it slightly for separate, similarly-sourced data.
This is a common approach in data science. Often, the duration of the wrangling process can limit the depth of an analysis for practical reasons. Using tried methods can help reduce the time needed to wrangle data and allow time for other parts of an analysis.
Let’s add the 2018 data for this group.
We import the image.
As you can see, we have repeated newlines (\n). We can remove these with some simplex regex.
df2_2018 <- gsub('([\n])\\1+',
'\\1',
df2_2018)
df2_2018 <- gsub("[[:punct:]]+",
"",
df2_2018)
df2_2018 <- gsub(" i ",
"",
df2_2018)We proceed, making slight modifications to the process as needed.
The bold font appears to have caused a typos.
# A tibble: 17 x 1
value
<chr>
1 CHINESE 41 23300
2 Men 45 12500
3 Women 37 10800
4 INDIAN B44 21800
5 Men 47 10400
6 Women 61 11300
7 KOREAN 55 9000
8 Men 56 4700
9 Women 54 4300
10 VIETNAMESE 63 15300
11 Men 76 9000
12 Women 50 6400
13 FILIPINO 68 20800
14 Men 63 10000
15 Women74 10800
16 HMONG102 5300
17 CAMBODIAN 138 4200
We fix the typos.
df2_2018 <- df2_2018 %>%
strsplit("\n") %>%
unlist() %>%
gsub(" B44 ","54",.) %>%
gsub("Women74","Women 74",.) %>%
gsub("HMONG102","HMONG 102",.) %>%
as_tibble()
df2_2018 %>%
print(.,n = dim(.)[1])# A tibble: 17 x 1
value
<chr>
1 CHINESE 41 23300
2 Men 45 12500
3 Women 37 10800
4 INDIAN 54 21800
5 Men 47 10400
6 Women 61 11300
7 KOREAN 55 9000
8 Men 56 4700
9 Women 54 4300
10 VIETNAMESE 63 15300
11 Men 76 9000
12 Women 50 6400
13 FILIPINO 68 20800
14 Men 63 10000
15 Women 74 10800
16 HMONG 102 5300
17 CAMBODIAN 138 4200
We then continue as we would normally.
df2_2018 <- df2_2018 %>%
dplyr::select(value) %>%
pull(value) %>%
str_split(" ")
df2_2018 <- lapply(df2_2018,
function(x) x[nchar(x) >= 1])
df2a_2018 <- df2_2018 %>%
map(~paste(.,collapse = "")) %>%
map(~gsub("[[:digit:]]+|[[:punct:]]+", "",.)) %>%
do.call(rbind,.) %>%
data.frame() %>%
tibble()
df2b_2018 <- map(df2_2018, tail, 2) %>%
do.call(rbind,.) %>%
data.frame() %>%
mutate_if(is.character, as.numeric) %>%
mutate_at(vars(-X2), ~ . * 0.1) %>%
tibble()
rm(df2_2018)
df2_2018 <- bind_cols(df2a_2018,
df2b_2018)
names(df2_2018) <- c()
column_names <- c("Group",
"Rate",
"N_2017")
colnames(df2_2018) <- column_names
df2_2018 <- df2_2018 %>%
dplyr::select(-N_2017)
df2_2018 <- df2_2018 %>%
dplyr::mutate(Year = 2018)
df2_2018 <- df2_2018 %>%
mutate(Gender = case_when(str_detect(Group, "Women") ~ TRUE,
str_detect(Group, "Men") ~ FALSE,
TRUE ~ NA))
labels <- unlist(df2_2018[c(seq(1,15, by=3),16,17),1], use.names = FALSE)
dim(df2_2018)[1][1] 17
labels_3 <- c(rep(labels[1:5], each = dim(df2_2018)[1]/(length(labels)-2)),
"HMONG",
"CAMBODIAN")
df2_2018$Subgroup <- labels_3
df2_2018 <- df2_2018 %>%
mutate(Type = "Subgroup Total")
glimpse(df2_2018)Rows: 17
Columns: 6
$ Group <chr> "CHINESE", "Men", "Women", "INDIAN", "Men", "Women", "KOREAN…
$ Rate <dbl> 4.1, 4.5, 3.7, 5.4, 4.7, 6.1, 5.5, 5.6, 5.4, 6.3, 7.6, 5.0, …
$ Year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, …
$ Gender <lgl> NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE…
$ Subgroup <chr> "CHINESE", "CHINESE", "CHINESE", "INDIAN", "INDIAN", "INDIAN…
$ Type <chr> "Subgroup Total", "Subgroup Total", "Subgroup Total", "Subgr…
df2_2018 <- df2_2018 %>%
mutate(Gender = case_when(Gender == TRUE ~ "Female",
Gender == FALSE ~ "Male",
TRUE ~ "All"))
df2_2018 <- df2_2018 %>%
mutate_if(is.character, tolower) %>%
mutate_if(is.character, toupper)
df2_2018 <- df2_2018 %>%
dplyr::select(-Group)The dataframe we produced does not contain totals.
# A tibble: 17 x 5
Rate Year Gender Subgroup Type
<dbl> <dbl> <chr> <chr> <chr>
1 4.1 2018 ALL CHINESE SUBGROUP TOTAL
2 4.5 2018 MALE CHINESE SUBGROUP TOTAL
3 3.7 2018 FEMALE CHINESE SUBGROUP TOTAL
4 5.4 2018 ALL INDIAN SUBGROUP TOTAL
5 4.7 2018 MALE INDIAN SUBGROUP TOTAL
6 6.1 2018 FEMALE INDIAN SUBGROUP TOTAL
7 5.5 2018 ALL KOREAN SUBGROUP TOTAL
8 5.6 2018 MALE KOREAN SUBGROUP TOTAL
9 5.4 2018 FEMALE KOREAN SUBGROUP TOTAL
10 6.3 2018 ALL VIETNAMESE SUBGROUP TOTAL
11 7.6 2018 MALE VIETNAMESE SUBGROUP TOTAL
12 5 2018 FEMALE VIETNAMESE SUBGROUP TOTAL
13 6.8 2018 ALL FILIPINO SUBGROUP TOTAL
14 6.3 2018 MALE FILIPINO SUBGROUP TOTAL
15 7.4 2018 FEMALE FILIPINO SUBGROUP TOTAL
16 10.2 2018 ALL HMONG SUBGROUP TOTAL
17 13.8 2018 ALL CAMBODIAN SUBGROUP TOTAL
We can find these totals in the PDF directly and create rows as needed
We load the PDF.
We add the rows.
df2_2018 <- df2_2018 %>%
add_row(Rate = 6.2,
Year = 2018,
Gender = "ALL",
Subgroup = "ALL",
Type = "ASIAN TOTAL") %>%
add_row(Rate = 6.1,
Year = 2018,
Gender = "FEMALE",
Subgroup = "ALL",
Type = "GENDER TOTAL") %>%
add_row(Rate = 6.4,
Year = 2018,
Gender = "MALE",
Subgroup = "ALL",
Type = "GENDER TOTAL")We repeat this process again for Latino/a subgroups.
The table, without the special formatting, looks like this.
There are no whitespaces in this table.
We proceed using what we’ve learned while wrangling the first two tables.
df3 <- magick::image_ocr(image3)
df3 %>%
base::strsplit("\n") %>%
base::unlist() %>%
tibble::as_tibble()# A tibble: 17 x 1
value
<chr>
1 LATINO 13.2 1,157,300
2 Latino Male 12.4 562,600
3 Latina Female 13.9 594,700
4 SOUTH AMERICAN 8.4 37,600
5 South American Male 9.1 20,400
6 South American Female 17 17,200
7 CENTRAL AMERICAN 12.0 93,100
8 Central American Male 9.3 37,900
9 Central American Female 15.0 55,200
10 MEXICAN 13.3 762,400
11 Mexican Male 12.2 358,200
12 Mexican Female 14.4 404,200
13 OTHER LATINO 13.6 44,800
14 Other Latino Male 15.3 27,600
15 Other Latina Female 11.5 17,300
16 PR, DR, Cuban Female 15.7 114,500
17 PR, DR, Cuban Female 14.4 96,600
We are often presented with scenarios where stand-alone approaches are difficult or time-consuming.
It is always best to document the steps take to respond to these scenarios. Wrangling this third table is a prime example of this.
We are missing a row. Let’s manually add the row.
df3 <- df3 %>%
strsplit("\n") %>%
unlist() %>%
as_tibble()
df3 <- df3 %>%
rbind("PR, DR, Cuban 15.1 211,200")We can now proceed as we did with the previous tables.
df3 <- df3 %>%
dplyr::select(value) %>%
pull(value) %>%
str_split(" ")
df3b <- map(df3, tail, 2) %>%
map(~gsub("[,]+|[.]+", "",.)) %>%
do.call(rbind,.) %>%
data.frame() %>%
mutate_if(is.character, as.numeric) %>%
mutate_at(vars(-X2), ~ . * 0.1) %>%
tibble()
df3a <- df3 %>%
map(~paste(.,collapse = "")) %>%
map(~gsub("[[:digit:]]+|[[:punct:]]+", "",.)) %>%
do.call(rbind,.) %>%
data.frame() %>%
tibble()
rm(df3)
df3 <- bind_cols(df3a, df3b)
names(df3) <- c()
column_names <- c("Group",
"Rate",
"N_2017")
colnames(df3) <- column_namesIf we look at the last few rows, we see that there is a typo. There are two female groups.
# A tibble: 6 x 3
Group Rate N_2017
<chr> <dbl> <dbl>
1 OTHERLATINO 13.6 44800
2 OtherLatinoMale 15.3 27600
3 OtherLatinaFemale 11.5 17300
4 PRDRCubanFemale 15.7 114500
5 PRDRCubanFemale 14.4 96600
6 PRDRCuban 15.1 211200
Sometimes when wrangling text data, we will come across a typo. We need to determine how to respond to the typo and make note of it. It’s often best to consult a secondary source to confirm that changes made are accurate.
For the purposes of this case study, we will assume that the first of the two rows represents male disconnection rates in the Latino/a subgroup; this would be consistent with the ordering of genders in the previous subgroups.
Let’s make the correction to the typo.
df3 <- df3 %>%
mutate(Group = case_when(Group == "PRDRCubanFemale" & N_2017 == 114500 ~ "PRDRCubanMale",
TRUE ~ Group))It looks like we’ve succesfully corrected the typo.
# A tibble: 6 x 3
Group Rate N_2017
<chr> <dbl> <dbl>
1 OTHERLATINO 13.6 44800
2 OtherLatinoMale 15.3 27600
3 OtherLatinaFemale 11.5 17300
4 PRDRCubanMale 15.7 114500
5 PRDRCubanFemale 14.4 96600
6 PRDRCuban 15.1 211200
We can continue with the process we’ve developed now that we have made the correction.
df3 <- df3 %>%
dplyr::select(-N_2017)
df3 <- df3 %>%
dplyr::mutate(Year = 2017)
df3 <- df3 %>%
mutate(Gender = case_when(str_detect(Group, "Female") ~ TRUE,
str_detect(Group, "Male") ~ FALSE,
TRUE ~ NA),
Subgroup = str_remove_all(Group,
pattern = paste(c("Female",
"Male",
"LATINO",
"Latino",
"Latina"),
collapse = "|"))) %>%
mutate(Subgroup = na_if(Subgroup, ""))
glimpse(df3)Rows: 18
Columns: 5
$ Group <chr> "LATINO", "LatinoMale", "LatinaFemale", "SOUTHAMERICAN", "So…
$ Rate <dbl> 13.2, 12.4, 13.9, 8.4, 9.1, 1.7, 12.0, 9.3, 15.0, 13.3, 12.2…
$ Year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, …
$ Gender <lgl> NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE…
$ Subgroup <chr> NA, NA, NA, "SOUTHAMERICAN", "SouthAmerican", "SouthAmerican…
df3 <- df3 %>%
mutate(Type = case_when(is.na(Gender) & is.na(Subgroup) ~ "Latino/a Total",
is.na(Gender) & !is.na(Subgroup) ~ "Subgroup Total",
!is.na(Gender) & is.na(Subgroup) ~ "Gender Total",
TRUE ~ "Subgroup Total"))
glimpse(df3)Rows: 18
Columns: 6
$ Group <chr> "LATINO", "LatinoMale", "LatinaFemale", "SOUTHAMERICAN", "So…
$ Rate <dbl> 13.2, 12.4, 13.9, 8.4, 9.1, 1.7, 12.0, 9.3, 15.0, 13.3, 12.2…
$ Year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, …
$ Gender <lgl> NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE…
$ Subgroup <chr> NA, NA, NA, "SOUTHAMERICAN", "SouthAmerican", "SouthAmerican…
$ Type <chr> "Latino/a Total", "Gender Total", "Gender Total", "Subgroup …
df3 <- df3 %>%
mutate(Gender = case_when(str_detect(Group, "Female") ~ "Female",
str_detect(Group, "Male") ~ "Male"))
df3 <- df3 %>%
mutate_if(is.character, tolower) %>%
mutate_if(is.character, toupper)
df3 <- df3 %>%
mutate(Gender = replace_na(Gender, "ALL"),
Subgroup = replace_na(Subgroup, "ALL")) %>%
dplyr::select(-Group) %>%
mutate(Subgroup)
df3 <- df3 %>%
mutate(Subgroup = case_when(Subgroup == "SOUTHAMERICAN" ~ "SOUTH AMERICAN",
Subgroup == "CENTRALAMERICAN" ~ "CENTRAL AMERICAN",
Subgroup == "PRDRCUBAN" ~ "PR/DR/CUBAN",
TRUE ~ Subgroup))Let’s add the 2018 data to this dataframe.
We import the image.
image5 <- image_read(here("img", "latino_a_subgroups_2018.png"))
df3_2018 <- magick::image_ocr(image5)As you can see, we have repeated newlines (\n). We can remove these with some simplex regex.
We proceed, making slight modifications to the process as needed.
df3_2018 <- df3_2018 %>%
strsplit("\n") %>%
unlist() %>%
as_tibble()
df3_2018 <- df3_2018 %>%
dplyr::select(value) %>%
pull(value) %>%
str_split(" ")
df3_2018 <- lapply(df3_2018,
function(x) x[nchar(x) >= 1])
df3a_2018 <- df3_2018 %>%
map(~paste(.,collapse = "")) %>%
map(~gsub("[[:digit:]]+|[[:punct:]]+", "",.)) %>%
do.call(rbind,.) %>%
data.frame() %>%
tibble()
df3b_2018 <- map(df3_2018, tail, 2) %>%
do.call(rbind,.) %>%
data.frame() %>%
mutate_if(is.character, as.numeric) %>%
mutate_at(vars(-X2), ~ . * 0.1) %>%
tibble()
rm(df3_2018)
df3_2018 <- bind_cols(df3a_2018,
df3b_2018)
names(df3_2018) <- c()
column_names <- c("Group",
"Rate",
"N_2017")
colnames(df3_2018) <- column_names
df3_2018 <- df3_2018 %>%
dplyr::select(-N_2017)
df3_2018 <- df3_2018 %>%
dplyr::mutate(Year = 2018)
df3_2018 <- df3_2018 %>%
mutate(Gender = case_when(str_detect(Group, "Women") ~ TRUE,
str_detect(Group, "Men") ~ FALSE,
TRUE ~ NA))
labels <- unlist(df3_2018[seq(1,12, by =3),1], use.names = FALSE)
dim(df3_2018)[1][1] 12
labels_3 <- rep(labels, each = dim(df3_2018)[1]/length(labels))
df3_2018$Subgroup <- labels_3
df3_2018 <- df3_2018 %>%
mutate(Type = "Subgroup Total")
glimpse(df3_2018)Rows: 12
Columns: 6
$ Group <chr> "SOUTHAMERICAN", "Men", "Women", "MEXICAN", "Men", "Women", …
$ Rate <dbl> 8.0, 7.5, 8.6, 12.9, 12.0, 13.8, 13.7, 14.9, 12.4, 13.7, 11.…
$ Year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, …
$ Gender <lgl> NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE…
$ Subgroup <chr> "SOUTHAMERICAN", "SOUTHAMERICAN", "SOUTHAMERICAN", "MEXICAN"…
$ Type <chr> "Subgroup Total", "Subgroup Total", "Subgroup Total", "Subgr…
df3_2018 <- df3_2018 %>%
mutate(Gender = case_when(Gender == TRUE ~ "Female",
Gender == FALSE ~ "Male",
TRUE ~ "All"))
df3_2018 <- df3_2018 %>%
mutate_if(is.character, tolower) %>%
mutate_if(is.character, toupper)
df3_2018 <- df3_2018 %>%
dplyr::select(-Group)
df3_2018 <- df3_2018 %>%
mutate(Subgroup = case_when(Subgroup == "SOUTHAMERICAN" ~ "SOUTH AMERICAN",
Subgroup == "CENTRALAMERICAN" ~ "CENTRAL AMERICAN",
Subgroup == "PRDRCUBAN" ~ "PR/DR/CUBAN",
TRUE ~ Subgroup))We load the PDF.
We add the rows.
df3_2018 <- df3_2018 %>%
add_row(Rate = 12.8,
Year = 2018,
Gender = "ALL",
Subgroup = "ALL",
Type = "LATINO/A TOTAL") %>%
add_row(Rate = 13.3,
Year = 2018,
Gender = "FEMALE",
Subgroup = "ALL",
Type = "GENDER TOTAL") %>%
add_row(Rate = 12.3,
Year = 2018,
Gender = "MALE",
Subgroup = "ALL",
Type = "GENDER TOTAL")We add the 2018 data to the dataframe
We will use multiple images to import the data on page 36 to produce maps.
MICHAEL ONTIVEROS
This code is not complete. If there is time, we will return to it! Magick is having trouble with quadrant 1 and 4. I could not figure out why.
quadrant1 <- image_read(here("img", "state_quadrant1.png"))
quadrant2 <- image_read(here("img", "state_quadrant2.png"))
quadrant3 <- image_read(here("img", "state_quadrant3.png"))
quadrant4 <- image_read(here("img", "state_quadrant4.png"))
quadrant1 <- magick::image_ocr(quadrant1)
quadrant2 <- magick::image_ocr(quadrant2)
quadrant3 <- magick::image_ocr(quadrant3)
quadrant4 <- magick::image_ocr(quadrant4)
labels_quad1_3 <- paste0(quadrant1, quadrant3)
labels_quad2_4 <- paste0(quadrant2, quadrant4)
labels_quad1_3 <- labels_quad1_3 %>%
strsplit("\n") %>%
unlist() %>%
as_tibble()
labels_quad2_4 <- labels_quad2_4 %>%
strsplit("\n") %>%
unlist() %>%
as_tibble()
df1 <- df1 %>%
dplyr::select(value) %>%
pull(value) %>%
str_split(" ")data_map <- map_data("state") %>%
filter(region=="california")
years <- c(seq(2008,2016,by=2),2017)
index_rep <- dim(data_map)[1]
data_map <- bind_rows(replicate(length(years), data_map, simplify = FALSE))
data_map$year <- rep(years, each = index_rep)
data_map <- data_map %>%
group_by(region, year) %>%
mutate(rank_ran = rank(year, ties.method = "random"))
data_map <- data_map[order(data_map$order),]
library(gganimate)
ggplot(data_map, aes(x = long, y = lat, group = group, fill=rank_ran)) +
geom_polygon() +
transition_time(time = year)Repeated Cross-sectional Data
We have pooled (repeated) cross-sectional data.
This is data produced from repeated measurement of a population over time.
It is often infeasible to collect data for an entire population at once. However, we can still obtain meaningful measures using a random sample of the population.
At specific time-points, data is collected from a sample of the population. The individuals in each sample are not necessarily the same individuals. This separates pooled cross-sectional data from panel data, which is longitudinal data from repeated measurement of the same people.
By sampling from a population at multiple time points, we can generate population level statistics. Although these statistics have some random error, they can provide insight into how the measure variable is changing in a population over time.
We can accomplish this by plotting the measured values over time. Sometimes, however, the trend isn’t exactly clear. Fortunately, there are statistical methods to resolve this issue.
The Mann-Kendall trend test—a variation of the Kendall rank correlation coefficient—tests whether there is a monotonic association, an association that does not increase or decrease but remains static across a dimension.
Recall the youth disconnection rates for Native Americans, some of the highest in the first table we examined.
Let’s conduct a Mann-Kendall test for trend.
We can accomplish this with the Kendall::MannKendall() function. The Kendall::MannKendall() accepts a vector of data for which a trend may be observed. Consulting the documentation for the Kendall::MannKendall() function available on CRAN, we can “test for a a monotonic trend in a time series”.
df1 %>%
filter(Gender == "ALL",
Race == "NATIVE AMERICAN") %>%
pull(Rate) %>%
MannKendall(.) %>%
summary()Score = -7 , Var(Score) = 28.33333
denominator = 15
tau = -0.467, 2-sided pvalue =0.25966
There does not appear to be a change in the trend. However, it’s important to note that we only have 6 observations.
We can also explore the trend using simple linear regression.
df1 %>%
filter(Gender == "ALL",
Race == "NATIVE AMERICAN") %>%
lm(Rate ~ Year, data = .) %>%
summary()
Call:
lm(formula = Rate ~ Year, data = .)
Residuals:
1 2 3 4 5 6
-2.4332 2.2978 0.8288 0.4597 0.2907 -1.4438
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 359.1159 487.3956 0.737 0.502
Year -0.1655 0.2421 -0.683 0.532
Residual standard error: 1.889 on 4 degrees of freedom
Multiple R-squared: 0.1045, Adjusted R-squared: -0.1193
F-statistic: 0.467 on 1 and 4 DF, p-value: 0.5319
For each one year change, the mean increase in disconnection rates is -0.1654795. This relationship is not statistically significant. Again, we are largely limited by the number of observations in this dataset.
We can visualize the relationship above.
df1 %>%
filter(Gender == "ALL",
Race == "NATIVE AMERICAN") %>%
ggplot(aes(x = Year, y = Rate)) +
geom_smooth(method = "lm", color = "red") +
geom_point() +
scale_x_continuous(breaks = seq(2008, 2018, by = 1),
labels = seq(2008, 2018, by = 1),
limits = c(2008, 2018)) +
theme_minimal() +
labs(title = "Youth Disconnection Rates of Native American Youth",
subtitle = "2008 - 2017",
x = "Year",
y = "Disconnection Rate")As we can see, there is a large amount of uncertainty around the fitted line.
Let’s visualize the data!
Let’s reproduce the example below.
We can create a version of the above example with ggplot from tidyverse.
There are color identifying websites only such as this.
Using one of these websites, we identify the hex triplet for the color used in the visualization included in the PDF: #008393.
fa_figurine <- image_read("https://upload.wikimedia.org/wikipedia/commons/7/7c/User_font_awesome.svg")
fa_figurine <- image_fill(fa_figurine,
color = "#008393",
point = "+800+800",
fuzz = 0)
fa_figurine <- image_fill(fa_figurine,
color = "#008393",
point = "+800+1000",
fuzz = 0)
df1 %>%
filter(Type == "RACE TOTAL") %>%
ggplot(aes(x = Year, y = Rate, group=Race)) +
geom_line(color = "#008393", size = 0.5) +
geom_point(color = "#008393", size = 3) +
scale_x_continuous(breaks = seq(2008,2018, by=1),
limits = c(2008,2018)) +
scale_y_continuous(breaks = seq(5,30, by =5),
limits = c(5,30)) +
draw_image(fa_figurine, x = 2017, y = 23.5, scale = 2) +
draw_image(fa_figurine, x = 2017, y = 17.5, scale = 2) +
draw_image(fa_figurine, x = 2017, y = 13, scale = 2) +
draw_image(fa_figurine, x = 2017, y = 9, scale = 2) +
draw_image(fa_figurine, x = 2017, y = 6.5, scale = 2) +
labs(title = "FIGURE 1 YOUTH DISCONNECTION BY RACE AND ETHNICITY, 2008 - 2017",
y = "YOUTH DISCONNECTION (%)") +
theme_classic() +
theme(title = element_text(size = 10,
color = "#008393",
face = "bold"),
axis.title.x = element_blank())We can build off of this idea, using a custom color palette to create a gradient based off the color used.
custom_pal <- colorRampPalette(c("white", "#008393"))
gender_n <- 3
asian_total <- df2 %>%
filter(Year == 2017) %>%
filter(Gender == "ALL",
Subgroup == "ALL") %>%
pull(Rate)
df2 %>%
filter(Year == 2017) %>%
complete(Gender, Subgroup) %>%
group_by(Gender) %>%
mutate(sub_rank = rank(Rate, ties.method = "min")) %>%
group_by(Subgroup) %>%
mutate(rank_all = sub_rank[Gender == "ALL"]) %>%
ungroup() %>%
mutate(Subgroup = fct_reorder(Subgroup, rank_all)) %>%
ggplot(aes(x = Subgroup, y = Rate, fill = Gender)) +
geom_hline(yintercept = asian_total,
color = "black",
linetype = 2) +
geom_bar(stat = "identity",
color = "transparent",
size = 0.5,
position = "dodge",
width = 0.5) +
labs(title = "FIGURE X YOUTH DISCONNECTION BY ASIAN SUBGROUP, 2017",
subtitle = "ORDERED BY OVERALL DISCONNECTION",
y = "YOUTH DISCONNECTION (%)",
fill = "Gender") +
scale_fill_manual(values = rev(custom_pal(gender_n + 1))) +
scale_y_continuous(breaks = seq(0,20,2),
labels = seq(0,20,2),
limits = c(0,20)) +
theme_classic() +
theme(title = element_text(size = 10,
color = "#008393",
face = "bold"),
axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90, hjust = 1)) +
annotate("text", label = 'bold("ASIAN TOTAL")',
color = "#008393",
size = 3,
x = 1.2,
y = asian_total + 1,
parse = TRUE)From the above plot, it becomes apparent that the Hmong subgroup produces a small proportion of the total number of asian disconnected youth. The Asian total youth disconnection rate is more alike the youth disconnection rates for all other subgroups than the Hmong youth disconnection rate.
We can confirm this by revisiting the table.
The Hmong group represents 6% of all Asian disconnected youth.
This shows the importance of adding small details such as the composite line to plots. It helps provide a simple yet nuanced picture of what is going on.
Lastly, we can add annotations to add provide even more depth to the visualization.
latino_a_total <- df3 %>%
filter(Year == 2017) %>%
filter(Gender == "ALL",
Subgroup == "ALL") %>%
pull(Rate)
df3 %>%
filter(Year == 2017) %>%
complete(Gender, Subgroup) %>%
group_by(Gender) %>%
mutate(sub_rank = rank(Rate, ties.method = "min")) %>%
group_by(Subgroup) %>%
mutate(rank_all = sub_rank[Gender == "ALL"]) %>%
ungroup() %>%
mutate(Subgroup = fct_reorder(Subgroup, rank_all)) %>%
ggplot(aes(x = Subgroup, y = Rate, fill = Gender)) +
geom_hline(yintercept = latino_a_total,
color = "black",
linetype = 2) +
geom_bar(stat = "identity",
color = "transparent",
size = 0.5,
position = "dodge",
width = 0.5) +
labs(title = "FIGURE X YOUTH DISCONNECTION BY LATINO/A SUBGROUP, 2017",
subtitle = "ORDERED BY OVERALL DISCONNECTION",
y = "YOUTH DISCONNECTION (%)",
fill = "Gender") +
scale_fill_manual(values = rev(custom_pal(gender_n + 1))) +
scale_y_continuous(breaks = seq(0,20,2),
labels = seq(0,20,2),
limits = c(0,20)) +
theme_classic() +
theme(title = element_text(size = 10,
color = "#008393",
face = "bold"),
axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90, hjust = 1)) +
annotate("text", label = 'bold("LATINO TOTAL")',
color = "#008393",
size = 3,
x = 1.2,
y = latino_a_total + 1,
parse = TRUE)df2 %>%
complete(Gender, Subgroup, Year) %>%
group_by(Subgroup) %>%
mutate(missing = sum(is.na(Rate))) %>%
filter(missing == 0) %>%
dplyr::select(-missing) %>%
ungroup() %>%
group_by(Gender) %>%
mutate(sub_rank = rank(Rate, ties.method = "min")) %>%
group_by(Subgroup, Year) %>%
mutate(rank_all = sub_rank[Gender == "ALL"]) %>%
ungroup() %>%
group_by(Year) %>%
mutate(threshold = Rate[Gender == "ALL" & Subgroup == "ALL"]) %>%
ungroup() %>%
mutate(Subgroup = fct_reorder(Subgroup, rank_all)) %>%
ggplot(aes(x = Subgroup, y = Rate, fill = Gender)) +
facet_wrap(Year ~., ncol = 1) +
geom_hline(aes(yintercept = threshold), linetype = 2) +
geom_bar(stat = "identity",
color = "transparent",
size = 0.5,
position = "dodge",
width = 0.5) +
labs(title = "FIGURE X YOUTH DISCONNECTION BY ASIAN SUBGROUP, 2017-2018",
subtitle = "ORDERED BY OVERALL DISCONNECTION",
y = "YOUTH DISCONNECTION (%)",
fill = "Gender") +
scale_fill_manual(values = rev(custom_pal(gender_n + 1))) +
scale_y_continuous(breaks = seq(0,10,2),
labels = seq(0,10,2),
limits = c(0,10)) +
theme_classic() +
theme(title = element_text(size = 10,
color = "#008393",
face = "bold"),
axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90, hjust = 1))df3 %>%
complete(Gender, Subgroup, Year) %>%
group_by(Subgroup) %>%
mutate(missing = sum(is.na(Rate))) %>%
filter(missing == 0) %>%
dplyr::select(-missing) %>%
ungroup() %>%
group_by(Gender) %>%
mutate(sub_rank = rank(Rate, ties.method = "min")) %>%
group_by(Subgroup, Year) %>%
mutate(rank_all = sub_rank[Gender == "ALL"]) %>%
ungroup() %>%
group_by(Year) %>%
mutate(threshold = Rate[Gender == "ALL" & Subgroup == "ALL"]) %>%
ungroup() %>%
mutate(Subgroup = fct_reorder(Subgroup, rank_all)) %>%
ggplot(aes(x = Subgroup, y = Rate, fill = Gender)) +
facet_wrap(Year ~., ncol = 1) +
geom_hline(aes(yintercept = threshold), linetype = 2) +
geom_bar(stat = "identity",
color = "transparent",
size = 0.5,
position = "dodge",
width = 0.5) +
labs(title = "FIGURE X YOUTH DISCONNECTION BY LATINO/A SUBGROUP, 2017",
subtitle = "ORDERED BY OVERALL DISCONNECTION",
y = "YOUTH DISCONNECTION (%)",
fill = "Gender") +
scale_fill_manual(values = rev(custom_pal(gender_n + 1))) +
scale_y_continuous(breaks = seq(0,20,2),
labels = seq(0,20,2),
limits = c(0,20)) +
theme_classic() +
theme(title = element_text(size = 10,
color = "#008393",
face = "bold"),
axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90, hjust = 1))avocadoThis concepts listed here must be revisited.
Terms and concepts covered:
Tidyverse
RStudio cheatsheets
Inference
Regression
Different types of regression
Ordinary least squares method
Residual
Packages used in this case study:
| Package | Use |
|---|---|
| here | to easily load and save data |
| tidyverse | for data science operations |
| pdftools | to manage PDF documents |
| magick | for image processing |
We would like to acknowledge Tamar Mendelson for assisting in framing the major direction of the case study.
We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.